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Abstract 

Several convex formulation methods have been proposed previously for statistical estimation with structured sparsity as 
the prior. These methods often require a carefully tuned regularization parameter, often a cumbersome or heuristic exercise. 
Furthermore, the estimate that these methods produce might not belong to the desired sparsity model, albeit accurately 
approximating the true parameter. Therefore, greedy-type algorithms could often be more desirable in estimating structured- 
sparse parameters. So far, these greedy methods have mostly focused on linear statistical models. In this paper we study the 
projected gradient descent with non-convex structured-sparse parameter model as the constraint set. Should the cost function have 
a Stable Model-Restricted Hessian the algorithm converges to the desired minimizer up to an approximation error. As an example 
■ we elaborate on application of the main results to estimation in Generalized Linear Models. 

I. Introduction 

In a variety of applications such as bioinformatics, medical imaging, social networks, and astronomy there is a growing 

] demand for computational methods that perform statistical inference on high-dimensional data. In the problems arising in these 
applications, p, the number of predictors in each sample is much larger than n, the number of observations. Although such 

] problems are generally ill-posed, in many cases the data has known underlying structure such as sparsity that can be exploited 
to make the problem well-posed. 

' Beyond the ordinary, extensively studied, sparsity model, a variety of structured sparsity models have been proposed in the 
literature [1 ]-[8]. These sparsity models are designed to capture the interdependence of the locations of the non-zero components 

] that is known a priori in certain applications. The models proposed for structured sparsity can be divided into two types. Models 
of the first type have a combinatorial construction and explicitly enforce the permitted "non-zero patterns" [4], [7], [9]. Greedy 

' algorithms have been proposed for the least squares regression with true parameters belonging to such combinatorial sparsity 
models [4], [9]. Models of the second type capture sparsity patterns induced by the convex penalty functions tailored for 

' specific estimation problems. For example, consistency of linear regression with mixed £i/£2-norm regularization in estimation 
of group sparse signals having non-overlapping groups is studied in [1]. Furthermore, a different convex penalty to induce 

[ group sparsity with overlapping groups is proposed in [3]. In [5], using submodular functions and their Lovasz extension, 
a more general framework for design of convex penalties that induce given sparsity patterns is proposed. In [8] a convex 

] signal model is proposed that is generated by a set of base signals called "atoms". The model can describe not only plain and 
structured sparsity, but also low-rank matrices and several other low-dimensional models. We refer readers to [10], [11] for 

[ extensive reviews on the estimation of signals with structured sparsity. 

In addition to linear regression problems under structured sparsity assumptions, nonlinear statistical models have been studied 

] in the convex optimization framework [1], [2], [6], [12]. For example, using the signal model introduced in [8], minimization 
of a convex function obeying a restricted smoothness property is studied in [12] where a coordinate-descent type of algorithm 

[ is shown to converge to the minimizer at a sublinear rate. In this formulation and other similar methods that rely on convex 
relaxation one needs to choose a regularization parameter to guarantee the desired statistical accuracy. However, choosing the 
appropriate value of this parameter may be intractable. Furthermore, the convex signal models usually provide an approximation 
of the ideal structures the estimates should have, while in certain tasks such as variable selection solutions are required to 
exhibit the exact structure considered. Therefore, in such tasks, convex optimization techniques may yield estimates that do 
not satisfy the desired structural properties , albeit accurately approximating the true parameter. These shortcomings motivate 
application of combinatorial sparsity structures in nonlinear statistical models, extending prior results such as [4], [9] that have 
focused exclusively on linear models. 

Among the non-convex greedy algorithms, a generalization of Compressed Sensing is considered in [13] where the 
measurement operator is a nonlinear map and the union of subspaces is assumed as the signal model. This formulation, 
however, admits only a limited class of objective functions that are described using a norm. Furthermore, [14] proposes a 
generalization of the Orthogonal Matching Pursuit algorithm [15] that is specifically designed for estimation of group sparse 
parameters in Generalized Linear Models (GLMs). Also, [16] studies the problem of minimizing a generic objective function 
subject to sparsity constraint from the optimization perspective. Using certain necessary optimality conditions for the sparse 
minimizer, a few iterative algorithms are proposed in [16] that converge to the sparse minimizer, should the objective satisfies 
some conditions. However, that work does not address the minimization under structured sparsity. 

S.B. is with the Department of Electrical and Computer Engineering at Carnegie Mellon University. 
P.B. is with Mitsubishi Electric Reseai'ch Labs. 

B.R. is with the Language Technologies Institute and the Department of Electrical and Computer Engineering at Carnegie Mellon University. 



1 



In this paper we study the projected gradient descent method to approximate the minimizer of a cost function subject to a 
model-based sparsity constraint. The algorithm is described in Section 11. The sparsity model considered in this paper is similar 
to the models in [4], ['^)] with minor differences in the definitions. To guarantee the accuracy of the algorithm our analysis 
requires the cost function to have a Stable Model-Restricted Hessian (SMRH) as defined in Section III. Using this property 
we show that for any given reference point in the considered model, each iteration shrinks the distance to the reference point 
up to an approximation error As an example. Section III considers the cost functions that arise in Generalized Linear Models 
and discusses how the proposed sufficient condition (i.e., SMRH) can be verified and how large the approximation error of 
the algorithm is. To make precise statements on the SMRH and on the size of the approximation error we assume some extra 
properties on the cost function and/or the data distribution. Finally, we discuss and conclude in Section V. 

Notation.: In the remainder of the paper we denote the positive part of a real number x by [x)^. For a positive integer k, 
the set {1,2,..., k) is denoted by \k]. Vectors and matrices are denoted by boldface characters and sets by calligraphic letters. 
The support set (i.e., the set of non-zero coordinates) of a vector x is denoted by supp(x). Restriction of a p-dimensional 
vector V to its entries corresponding to an index set I C [p] is denoted by vjj;. Similarly Ai denotes the restriction of a matrix 
A to the rows enumerated by I. For square matrices A and B we write B =^ A to state that A — B is positive semidefinite. 
We denote the power set of a set ^ as 2-^. For two non-empty families of sets T\ and T2 we write Ti IM) to denote another 
family of sets given by {Xi U A:2 | .Yi G J^i and X2 G ^2}- Moreover, for any non-empty family of sets F for conciseness we 
set J^-' = . . M J- where the operation ^ is performed j — 1 times. The inner product associated with a Hilbert space % is 
written as (■,■). The norm induced by this inner product is denoted by ||-|[. We use V/ (•) and V^/ (■) to denote the gradient 
and the Hessian of a twice continuously differentiable function / : M> M. For an index set I C [p] with p = dim (H), the 
restriction of the gradient to the entries selected by I and the restriction of the Hessian to the entries selected by I x I are 
denoted by Vx/ (•) and V^f (•), respectively. Finally, numerical superscripts within parentheses denote the iteration index. 

II. Problem Statement and Algorithm 

To formulate the problem of minimizing a cost function subject to structured sparsity constraints, first we provide a definition 
of the sparsity model. This definition is an alternative way of describing the Combinatorial Sparse Models in [7]. In comparison, 
our definition merely emphasizes the role of a family of index sets as a generator of the sparsity model. 

Definition 1. Suppose that p and k are two positive integers with k ^ p. Furthermore, denote by Ck a family of some 
non-empty subsets of [p] that have cardinality at most k. The set U^gCt ^"^ called a sparsity model of order k generated by 
Ck and denoted by M {Ck)- 

Remark 1. Note that if a set 5 £ Ck is a subset of another set in Ck, then the same sparsity model can still be generated after 
removing S from Ck (i.e., Ai (Ck) = M {Ck\ {S})). Thus, we can assume that there is no pair of distinct sets in Ck that one 
is a subset of the other. 

In this paper we aim to approximate the solution to the optimization problem 

argmin/(e) s.t. supp (9) £ 7W (Cfe) , (1) 

where / : 7^ M> M is a cost function with "H being a p-dimensional real Hilbert space, and A4 (Ck) a given sparsity model 
described by Def. 1. To approximate a solution to (1) we use a projected gradient descent method summarized in Alg. 1. The 
only difference between Alg. 1 and standard projected gradient descent methods studied in convex optimization literature is 
that the projection, in line 3, is performed onto the generally non-convex set M (Ck)- The projection operator Pck,r ■ 'H ^ H 
at any given point Qq ^ "H is defined as a solution to 

argmin ||e - 60 1| s.t. supp(e) G M {Ck) and |le|l < r. (2) 

Remark 2. In the context of statistical estimation, the cost function / (•) is usually the empirical loss associated with some 
observations generated by an underlying true parameter Q*. In these problems, it is more desired to estimate 9* as it describes 
the data. The analysis presented in this paper allows evaluating the approximation error of the proposed algorithm with respect 
to any parameter vector in the considered sparsity model including 9 and 9*. However, the approximation error with respect 
to the statistical truth 9* can be simplified and interpreted to a greater extent. We elaborate more on this in Section III. 

Remark 3. Assuming that for every S <E Ck the cost function has a unique minimum over the set 
{9 I supp(9) C S and ||9|| < r}, the operator Pc^.r [•] can be defined without invoking the axiom of choice because there 
are only a finite number of choices for the set S. One may also question the necessity of the constraint ||9|| < r in (2). As 
discussed later in Section IV, in statistical estimation problems where the cost function is not quadratic the sufficient condition 
we rely on cannot be guaranteed to hold unless the iterates and the true parameter lie in a bounded set. This shortcoming 
is typical for convergence proofs that use similar types of conditions (cf. [17]-[20]). Finally, the exact projection onto the 
sparsity model {Ck ) might not be tractable. One may desire to show that accuracy can be guaranteed even using an inexact 
projection operator, at the cost of an extra error term. Existence and complexity of algorithms that find the desired exact or 
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Algorithm 1: Gradient Descent with Model Sparsity Constraint 
input : Ck, the family of possible supports, 
r, the radius of feasible set 

i ^ , e^*) ^ 

repeat 

1 Choose step-size r;''' > 

2 x^') <^ 9^') -77Wv/(0(')) 

4 i < — i + 1 
until halting condition holds 
return 9**^ 



approximate projections, disregarding the length constraint in (2) (i.e., Pc^.+oo [•]), are studied in [7], [9] for several interesting 
sparsity models. Also, in the general case where r < +oo one can derive a projection Pck.r [9] from Pc^.+oo [9] (see Lemma 
2 in the Appendix). It is straightforward to generalize the guarantees in this paper to cases where only approximate projection 
is tractable. However, we do not attempt it here; our focus is to study the algorithm when the cost function is not necessarily 
quadratic. Instead, we apply the results to statistical estimation problems with non-linear models and we derive bounds on the 
statistical error of the estimate. 

III. Theoretical Analysis 

A. Stable Model-Restricted Hessian 

In order to demonstrate accuracy of estimates obtained using Alg. 1 we require a variant of the Stable Restricted Hessian 
(SRH) condition proposed in [21] to hold. The SRH condition basically characterizes cost functions that have bounded curvature 
over canonical sparse subspaces. In this paper we require this condition to hold merely for the signals that belong to the 
considered model. Furthermore, we explicitly bound the length of the vectors at which the condition should hold. As will be 
discussed later, this restriction is necessary in general for non-quadratic cost functions. The condition we rely on, the Stable 
Model-Restricted Hessian (SMRH), can be formally defined as follows. 

Definition 2. Let f : H i-^ M. he a twice continuously differentiable function. Furthermore, let ac^ and /3c^ be in turn the 
largest and smallest real numbers such that 

/?cJ|Af <(A,VV(e)A)<acJ|A|^ (3) 

holds for all A and 9 such that supp (A)Usupp (9) G Ai (Ck) and ||9|| < r. Then / is said to have a Stable Model-Restricted 
Hessian with respect to the model Ai (Ck) with constant fic > 1 in a sphere of radius r > 0, or in short (/xc^ ,r')-SMRH, if 

1 < acJPck < fJ-Ck- 

Remark 4. Typically in parametric estimation problems a sample loss function l{Q,x,y) is associated with the covariate- 
response pair (x, y) and a parameter 9. Given n iid observations the empirical loss is formulated as L„ (9) = 
i X]r=i ^ ^iiUi)- The estimator under study is the minimizer of the empirical loss, perhaps considering an extra regularization 
or constraint for the parameter 9. Most of the algorithms proposed for sparse estimation problems require that the cost function 
is strongly convex over a restricted but unbounded set of directions around the true parameter 9*. It is known, however, that 
Ln (9) as an empirical process is a good approximation of the expected loss L (9) = E [Z (9, x, y)] (see [22] and [23, Chapter 
5]). If the required sufficient condition is not satisfied by L (9) for a valid choice of 9*, then in general it cannot be satisfied 
at the same 9* by L„ (9) either Thus, as also assumed in the prior work either explicitly [1 7] or implicitly [l8]-[20], for a 
generic sample loss it is only possible to guarantee these types of sufficient conditions if the set of valid vectors 9* are further 
restriced, e.g., by bounding their length. This is the motivation behind the restriction imposed on the length of 9 in Def. 2. 
Of course, if the true parameter violates this restriction we may incur an estimation bias as quantified in Theorem I. 

B. Accuracy Guarantee 

Using the notion of SMRH we can now state the main theorem. 

Theorem 1. Consider the sparsity model A4 (Ck) for some fc G N and a cost function / : 7^ i— > K. that satisfies the {^^Q3,r^- 
SMRH condition with parameters ar3 and Prs in (3). If if = 2/ [ars + (3ra ) then for any 9 G [Ck) with 11 9 11 < r the 

k k \ k k J MM 

iterates of Alg. 1 obey 



0(»+i) _0 



9**) - 9 



27?W||Vy/(9)||, (4) 
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where 7 



and I = snpp (Pc2, [V/ (9)] 



Remark 5. One should choose the step size to achieve a contraction factor 27*^'' that is as small as possible. Straightforward 
algebra shows that the constant step-size jy'*^ = rj* is optimal, but this choice may not be practical as the constants a^s and /J^a 

f ■\ k k 

might not be known. Instead, we can always choose the step-size such that 1 /cecl ^ ^7 ^ ^/^^l provided that the cost function 
obeys the SMRH condition by setting 77'') = 1/ (A, V^/ (9) A) for some A, 9 S H such that supp (A)Usupp (9) £ M (C^). 
For this choice of ry^*^ we have 7'*^ < /.i|^3 — 1. 

Corollary 1. A fixed step-size rj > coefficient corresponds to a fixed contraction coefficient 7 
this case, assuming that 2j ^ 1, the i-th iterate ofAlg. 1 satisfies 

i-(27r 



Jl 



4 - 1 



In 



0(*) - 9 



< (27)' 9 +2r;: 



1 - 27 



Vt/ 9 



(5) 



In particular. 



(i) if /ig3 < 3 and rj = rj* = 2/ [oiq3 + ), or 



(ii) ;/ jiQS < I and rj G 



l/ac3,l//3c3 



the iterates converge to 9 up to an approximation error bounded above by ll^i'^ (®) II contraction factor 27 < 1. 

Proof: Applying (4) recursively under the assumptions of the corollary and using the identity X]j=o i'^lY ^ ^7-27' 
proves (5). In the first case, if ji^s < 3 and rj = rj* = 2/ (^acl + ^clj hciwe 27 < 1 by definition of 7. In the second 

case, one can deduce from rj S l/a^s, l//?c| that \rj/rj* — 1| < — | — and r]/rj* < — | — where equalities are attained 
simultaneously at r] = V/'cl- Therefore, 7 < ji^-a — 1 < 1/2 and thus 27 < 1. Finally, in both cases it immediately follows 
from (5) that the approximation error converges to Y^|:j; ||^r/ (S) || from below as i ^ +00. ■ 

IV. Application in Generalized Linear Models 

Generalized Linear Models (GLMs) are among the most commonly used models for parametric estimation in variety of ap- 
plications [24]. Linear, logistic, Poisson, and gamma models used in corresponding regression problems all belong to the family 
of GLMs. Given a covariate vector x E X CMP and a true parameter 9* £ MP, the response variable y € y CM. in canonical 
GLMs is assumed to follow an exponential family conditional distribution: y \ k;Q* ^ Z (y) exp {y (x, 9*) — tp ((x, 9*))) , 
where Z [y) is a positive function, and '0 : K K is the log-partition function that satisfies -0 [t) = log jy Z (y) exp {ty) dy for 
all f G R. Examples of the log-partition function include but are not limited to i/'un (t) = i^/2(T^, -^log (t) = log (1 + exp {t}), 
and i/jpois {t) ~ exp (t) corresponding to linear, logistic, and Poisson models, respectively. 

Suppose that n iid covariate-response pairs {(xi,yi)}"^j are observed. In the Maximum LikeUhood Estimation (MLE) 
framework the negative log likelihood is used as a measure of the discrepancy between the true parameter 9"^ and an estimate 
9 based on the observations. Formally, the average of negative log likelihoods is considered as the empirical loss 
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/(9) = - Vv^((x„9))-y,(x„9), 

n ^ — ^ 



i=l 



and the MLE is performed by minimizing / (9) over the set of feasible 9. The constants c and Z that appear in the distribution 
are disregarded as they have no effect in the outcome. 



A. Verifying SMRH for GLMs 

Assuming that rp {■) is twice continuously differentiable, the Hessian of / (•) is equal to 



1 " 

VV(9) = -^0"((x„9))x,xT. 



1=1 



Under the assumptions for GLMs, it can be shown that r/;" (■) is non-negative (i.e., rp (•) is convex). For a given sparsity model 
generated by Ck let S be an arbitrary support set in Ck and suppose that supp (9) C 5 and \\6\\ < r . Furthermore, define 

D^^r (u) max rj/' (tu) and d^,^r (u)'.— min rp" (tu) . 

Using the Cauchy-Schwarz inequality we have |(xi, 9)| < ?' ||xi|^|| which implies 

1 " 1 " 
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These matrix inequalities are precursors of (3). Imposing further restriction on the distribution of the covariate vectors {x, }"^^ 
allows application of the results from random matrix theory regarding the extreme eigenvalues of random matrices (see e.g., 
[25] and [26]). 

For example, in the logistic model where ip = ■(/'log we can show that D-^_r (u) = \ ™d d^,_r (u) = iscch^ (3^)- Assuming 
that the covariate vectors are iid instances of a random vectors whose length almost surely bounded by one, we obtain 
d,p.r (u) > |;SGch^ (0. Using the matrix Chernoff inequality [2:i] the extreme eigenvalues of ^X^X^ can be bounded with 
probability 1 — cxp(logA: — Cn) for some constant C > (see [21] for detailed derivations). Using these results and taking 
the union bound over all 5 G Cfe we obtain bounds for the extreme eigenvalues of V^/ (9) that hold uniformly for all sets 
S eCk with probability 1 - exp (log (fc \Ck\) - Cn). Thus (3) may hold if n = O (log (fc |Cfc|)). 



B. Approximation Error for GLMs 

Suppose that the approximation error is measured with respect to 9^ = Pci-.r [9*] where 9* is the statistical truth in the 
considered GLM. It is desirable to further simplify the approximation error bound provided in Corollary 1 which is related 
to the statistical precision of the estimation problem. The corollary provides an approximation error that is proportional to 

We can write 



Vr/ [Q^j where T = supp [Vci.r [v/ [Q^ 

1=1 

which yields V7-./ fs^^ = l|Xrzj| where X = [ xi X2 • • • x„ ] 

"Vr/f9^) < IIX 



and zl -z - '^'(("-'^"))-^- 
ana z|{,j - z, - ^ 



Therefore, 



ri 



op 



where IHI^p denotes the operator norm. Again using random matrix theory one can find an upper bound for HXxH^p that holds 
uniformly for any I G and in particular for I = T. Henceforth, > is used to denote this upper bound. 
The second term in the bound can be written as 



To further simplify this term we need to make assumptions about the log -partition function ip {■) and/or the distribution of the 
covariate-response pair (x, y). For instance, if ip' (•) and the response variable y are bounded, as in the logistic model, then 



Hoeffding's inequality implies that for some small e > we have |[z|p < E 



V/ (x,9 



-e with probability at least 



1 — cxp (—0 (e^n^)). Since in GLMs the true parameter 9* is the minimizer of the expected loss E [ip ((x, 9)) — y (x, 9) | x] 
we deduce that E ((x, 9*)) - y | x] = and hence E \%p' ((x, 9*)) - y] = 0. Therefore, 



< E 



< 



E 



(V/ ((x, 9^)) - ^' ((x, 9*)) ((x, 9*)) - 2/)' I X 



V/ (x,9^) -V/((x,9*)) 



-E 



(^'((x,9*))-yr 



+ e 



Then it follows from Corollary 1 and the fact that HXI^^Hq <W that 





< 


0W _ e-L 


+ 















< (27)* 



2vW 2 



1-27 



2r]W 
1-27 



61 + 62- 



Note that the total approximation error is comprised of two parts. The first part is due to statistical error that is given by 
1-27 "^^taf '^^^ + '^2 is the second part of the error due to the bias that occurs because of an infeasible true parameter 

The bias vanishes if the true parameter lies in the considered bounded sparsity model (i.e., 9* = Pck,r [9*])- 
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V. Conclusion 

We studied the projected gradient descent method for minimization of a real valued cost function defined over a finite- 
dimensional Hilbert space, under structured sparsity constraints. Using previously known combinatorial sparsity models, we 
define a sufficient condition for accuracy of the algorithm, the SMRH. Under this condition the algorithm converges to the 
desired optimum at a linear rate up to an approximation error. Unlike the previous results on greedy-type methods that merely 
have focused on linear statistical models, our algorithm applies to a broader family of estimation problems. To provide an 
example, we examined application of the algorithm in estimation with GLMs. One can verify the SMRH for a specific statistical 
model. The approximation error can also be bounded by statistical precision and the potential bias. An interesting follow-up 
problem is to find whether the approximation error can be improved and the derived error is merely a by-product of requiring 
some form of restricted strong convexity through SMRH. Another problem of interest is to study the properties of the algorithm 
when the domain of the cost function is not finite-dimensional. 



Appendix 
Proofs 



Lemma 1. Suppose that J is a twice differentiable function that satisfies (3) for a given 9 and all A such that supp (A) U 
supp (9) G A4 (Cfc). Then we have 



|(u,v)-ry(u,VV(9)v)|< (r; "^-/^- 



for all r] > and u, v G H such that supp (u ± v) U supp (9) G Ai (Cfe). 

Proof: We first the prove the lemma for unit-norm vectors u and v. Since supp (u ± v) U supp (9) G M (Cfe) we can 
use (3) for A = u ± v to obtain 

pc, ||u ± vf < (u ± V, (9) (u ± v))< ac, \\n± vf . 

These inequalities and the assumption ||u|| = ||v|| = 1 then yield 

+ ^^^^ (u, V) < (u, W (9) v) < ^^^^ + ^^^^ (u, V) , 

where we used the fact that V^/ (9) is symmetric since / is twice continuously differentiable. Multiplying all sides by 77 and 
rearranging the terms then imply 



v- 



> |(u,v) -77(u,V2/(9)v)| - 

> |(u,v) -77(u,V2/(9)v)| - 



1) (u,v) + (u,v)-7y(u,V2/(9)v) 



-1 (u,v) 



(6) 



which is equivalent to result for unit-norm u and v as desired. For the general case one can write u = ||u|| u' and v = ||v|| v' 
such that u' and v' are both unit-norm. It is straightforward to verify that using (6) for u' and v' as the unit-norm vectors 
and multiplying both sides of the resulting inequality by ||u|| ||v|| yields the desired general case. ■ 



Proof of Theorem 1: Using optimality of 9'^*'''^' and feasibility of 9 one can deduce 



with x*^*' in line 2 of Alg. 1. Expanding the squared norms using the inner product of H then shows < 
_0^2x(')-9^*+^' - 9\ or equivalently < (a^'+i' , 29^'' -27?^ V/ f 9 + A^*') - A^'+^A , where A^*' = 9 



and A 



0('+i) _ Adding and subtracting 277'^') ( , V/ (9) ) and rearranging yields 



< 2^A(*+^\9(*') - 277^^) (a('+i\v/ (9 + A«) - V/(9; 
- 277«(a(*+i),V/ (9] 



(7) 



Since / is twice continuously differentiable by assumption, it follows form the mean-value theorem that 
V/(^9 + A^*'^ -V/(9)^ = ^A('+^', (0 A^')^, for some t G (0,1). Furthermore, because 9, 

0(0^ 0(»+i) all belong to the model set M (Ck) we have suppfg+tA^'A G M {Cl) and thereby supp (a^'+^A U 
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supp Invoking the ^/^cjji ^^''^'^'^ condition of the cost function and applying Lemma 1 with 

the sparsity model M (C^), = 9+ tA^*\ and 77 = ij^^^ then yields 



^(^+1) ^ \ _ / ^(^+1) , V/ f + A«) - V/ (0 



Using the Cauchy-Schwarz inequality and the fact that 

2 



implies that 
the theorem. 



< 27 



27] 



(0 



^supp(A(' + i))/ (^) 



< Vy/(0) by the definition of I, (7) 



Vj/ (0) . Canceling 



from both sides proves 



Lemma 2 (Bounded Model Projection). Given an arbitrary h.Q £ Ji, a positive real number r, and a sparsity model generator 
Ck, a projection Pcfc,r [ho] can be obtained as the projection ofY'ck,+ao [ho] on to the sphere of radius r. 

Proof: To simpUfy the notation let h = Pcfc.r [ho] and S = supp ^h^ . For S C [p\ define 
ho (iS) = arg min ||h — ho|| s.t. ||h|| < r and supp (h) C 5. 

h 

It follows from the definition of Pct.r [ho] that S G argmin^gCs ||ho [S) — ho||. Using 



|ho {S) 



|ho {S) 



Ms 



hoi 



|ho (5)- hol^l 



Ms- 



we deduce that ho (iS) is the projection of hol^ onto the sphere of radius r. Therefore, we can write ho (5) = 
min {l,r/ || hol^ ||} hol^ and from that 



G arg min ||min {1, r/ || ho|^| 



ho 1.9 - ho 



arg min ||min{0,r/ ||ho|<j|| - 1} ho|<j| 



|ho[ 



arg min ( (1 - r/ || ho| <j| 



- 1 



Ms\ 



argmax q{S) := Hhol^jl^ - ([[hol^lj - r)\ . 



Furthermore, let 



So = supp (P, 



Ck-+oo 



[ho] 



arg max 



(8) 



If II hol^^JI <r then q {S) = [| ho|^ || < q [Sq) for any S G Ck and thereby S ~ So- Thus, we focus on cases that || hol^j^ || > r 



which implies q (Sq) = 2 || hol^^JI r — r^. For any 5 G Cfe if || hol^j || < r we have q (S) = || hoj 



<r-2 < 2 hoi. 



qiSo), and if [[hol^ll > r we have q{S) = 2 [jhol^Hr - H < 2 [[hol^J 



q (So) where (8) is applied. Therefore, we 



have shown that S = Sq. It is then straightforward to show the desired result that projecting Pc^.+oo [ho] onto the centered 
sphere of radius r yields Pc&.r [ho]- ■ 
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